Predicting Students’ Chance of Admission Using Beta Regression
Author
Marcus Chery and Keiron Green
Published
November 6, 2023
Introduction
Beta regression is a statistical methodology tailored to address the challenges associated with modeling continuous bounded response variables, often constrained within the open interval (0, 1). Beta regression, as introduced by (Ferrari and Cribari-Neto 2004) in their seminal paper, offers an elegant framework for modeling proportions and percentages with an emphasis on interpretability of model parameters. This foundational work serves as a cornerstone for subsequent research and applications.
This report delves into the application of beta regression within the context of heart disease data analysis. Our objective is to provide a thorough understanding of beta regression, its practical implementation using the “betareg” package in R, and its real-world relevance through heart disease data. Throughout the report, we offer practical examples, code snippets, and graphs to facilitate hands-on learning. Additionally, we address the limitations of beta regression and discuss potential future research directions. The goal of this report is to equip readers with the knowledge and skills needed to effectively utilize beta regression in heart disease analysis, while also fostering awareness of its boundaries and avenues for further exploration.
In beta regression, the dependent variable is assumed to follow a beta distribution, and the goal is to model its relationship with one or more independent variables. Beta regression is suitable for data where the response variable lies within a closed interval, such as proportions, percentages, or rates. It considers both the mean and the dispersion of the data (Ferrari and Cribari-Neto 2004) .
The beta distribution is characterized by two parameters: alpha (α) and beta (β). These parameters can be modeled as a function of predictor variables, allowing for the analysis of how independent variables influence the shape of the distribution.
Like logistic regression, beta regression often uses a link function to relate the mean of the beta distribution to the linear combination of predictor variables. Common link functions include logit, probit, and log-log. The coefficients obtained from beta regression provide insights into how changes in the predictor variables affect the distribution of the response variable, including its mean and variability.
Complementing this, the “betareg” package for R, as presented by (Cribari-Neto and Zeileis 2010), provides various methods and procedures for implementing beta regression models in R. Building on the foundation laid by (Ferrari and Cribari-Neto 2004), (Cribari-Neto and Zeileis 2010) introduce the “betareg” package for R, offering a practical implementation of beta regression models. This paper is instrumental in bringing beta regression into the realm of data analysis and showcases the package’s utility in addressing challenges related to continuous bounded response variables.
(Guolo and Varin 2014) seek to expand on beta regression. The authors incorporate a Gaussian copula to model serial dependence in the data, making it applicable to time series analysis. The study’s practical application involves analyzing influenza-like-illness incidence data from the Google Flu Trends project. Their methodology provides a framework for analyzing bounded time series data and offers insights into inference, prediction, and control in this context.
Moreover, the paper authored by (Patrícia L. Espinheira and Cribari-Neto 2008) unveils innovative residuals, based on Fisher’s scoring iterative algorithm, enhancing model assessments and bias correction—a pivotal contribution to the methodology’s practical applicability. They address a common challenge in statistical analysis by offering a practical solution for improved model assessments and bias correction. The study presents new residuals based on Fisher’s scoring iterative algorithm, demonstrating their superiority in various scenarios, particularly in small sample sizes.
Various papers throughout the years have sought to improve the basic beta regression techniques of (Ferrari and Cribari-Neto 2004). (Simas, Barreto-Souza, and Rocha 2010) delve into bias correction methods for beta regression models in the quest for improved estimators in beta regression. They address biases in parameter estimation, a critical concern. Recognizing the potential for bias in parameter estimation, especially in small sample sizes, the paper provides closed-form expressions for second-order biases in maximum likelihood estimators. The authors present various methods for bias correction, including analytical approaches and bootstrap bias adjustment, enhancing the accuracy of model assessments and mitigating erroneous conclusions.
Likewise, (Schmid 2013) introduced one such enhancement known as the “boosted beta regression” as a novel extension of classical beta regression. (Schmid 2013) address complex modeling scenarios with many predictor variables, multicollinearity, nonlinear relationships, and overdispersion dependencies. They utilize the gamboostLSS algorithm to efficiently estimate beta regression models, making it possible to fit models while preserving the structure of beta regression. The method’s effectiveness is demonstrated through ecological data applications, highlighting its advantages in terms of model fit and prediction accuracy.
(Abonazel et al. 2022) took a different approach. They focused on improving the estimation of beta regression models when multicollinearity was present among regressors. They introduced the Dawoud–Kibria estimator as an alternative to the conventional maximum likelihood estimator (BML) when dealing with such situations. The authors derived the properties of this new estimator and assess its performance through theoretical comparisons, Monte Carlo simulations, and a real-life application.
Finally, it is important to note the challenges of beta regression. (Douma and Weedon 2019) addresses one challenge of analyzing proportional data in ecology and evolution. Proportional data, common in these fields, pose unique statistical challenges due to their bounded nature and varying variances. The paper introduces beta and Dirichlet regression techniques as effective tools for analyzing such data. These methods allow for analysis at the original scale, reducing bias and simplifying interpretation. The article provides practical guidance on model fitting and reporting.
(Couri et al. 2022) focuses on the computational challenges of estimating parameters in beta regression models. The authors evaluate ten algorithms for parameter estimation and find that four, including simulated annealing, perform well. The paper also examines the optim function in R and suggests its use in challenging cases. Simulated annealing is recommended, particularly when the optim function fails. Overall, the research offers insights into improving parameter estimation for beta regression models in various applications.
Furthermore, one such limitation of classical beta regression is its handling of zero or one inflated data. (Ospina and Ferrari 2012) introduce a versatile class of regression models, zero-or-one inflated beta regression, specifically tailored to address datasets containing continuous proportions with zeros or ones. The paper offers a flexible solution for modeling mixed continuous-discrete distributions, essential in real-world scenarios such as income contributions or work-related activities. By combining the beta distribution with a degenerate distribution at zero or one, the authors provide a robust framework for effective modeling, parameter estimation, diagnostics, and model selection.
Methods
Our objective is to investigate the relationship and effect of the given variables on the response variable which in this case is the cholesterol count proportional. Being that the chosen response is proportional in nature, linear regression is not appropriate. The suitable approach is the implementation of beta regression. The reason as to why linear regression is not appropriate is due predicted values falling outside the bound of (0,1) where as all the proportional values are within the bounds of 0 and 1.Our objective is to implement beta regression to predict the proportion values based on grouped cholesterol values.In other words,using beta regression to predict the proportion of a specific cholesterol value given that chosen cholesterol value.Before implement the beta regression, it was necessary to create the proportional variable. This was done by counting similar cholesterol values and grouping them together then dividing the grouped values by the total number of examples in the data set.
After the creation the proportional response, Cholesterol_count_proportional,the beta regression algorithm is implemented.Before delving into the implementation of the beta regression let’s briefly analyze the mathematics of the beta regression algorithm. The beta regression is based on the logit function. The logit function plots probabilities between the interval 0 and 1. The logit function can be seen below. \[ logit(p) = log(p/(1-p))\]
Another formula which will need to be accounted for in predicting the proportionality values is the beta density.The beta density is especially useful in modelling probabilities as it works on the interval [0,1] When implementing beta regression it is assumed that the response variable follows a beta distribution hence the name beta regression.
The beta density formula is below. \[
f(x;\alpha,\beta) =\frac{ \Gamma(p + q)}{\Gamma(p)*\Gamma(q)}x^{p-1}(1-x)^{q-1}
\] , \[ 0 < y < 1 \] where \(p,q\) > 0 and \(\Gamma (.)\) is the Gamma function. An alternate parameterization was proposed by (Ferrari and Cribari-Neto 2004) by setting \(u = p/(p+q)\) and \(\phi = p + q\)
We denote \(y\sim B(u,\phi)\) where \(E(y)=u\) and \(VAR(y)=u(1-u)/(1+\phi)\). The \(\phi\) is known as the precision parameter. For a fixed value of \(u\) as the \(\phi\) increases the variance of the response variable \(y\) decreases.
Let \(y_1,.....,y_n\) be a random sample set given that $y_i(u_i,), i = 1,……,n. The beta regression model is defined as \[ g(u_i) = x^T_iB = n_i\] where \(B=(B_1,....,B_k)\) is a \(k*1\) vector of unknown regression parameters \((k<n),x_i=(x_i,......x_n)^T\) is the vector of k regressors(independent variables or covariates) and \(n_i\) is a linear predictor - i.e.,\(n_i = B_1x_{i1}+...+B_kx_{ik};\) usually \(x_{i1}\) = 1 for all \(i\) so that the model has an intercept.The \(g(.): (0,1) -> IR\) represent the link function.
#create scatterplot of x1 vs. y1#plot(chance_of_admit ~ gre_score , data = admission, col='blue', pch=19,#xlab='Test Scores', ylab='Chance of Admit', main='Chance of Admit vs Gre Score -blue vs Toefl Score - red', xlim = c(75,350))#add scatterplot of x2 vs. y2#points(chance_of_admit ~ toefl_score, data = admission, col='red', pch=19)
Code
#create scatterplot of x1 vs. y1plot(chance_of_admit ~ toefl_score , data = admission, col='red', pch=19,xlab='Toefl Score', ylab='Chance of Admit', main='Chance of Admit vs Toefl Score')# Linear fitabline(lm(admission$chance_of_admit ~ admission$toefl_score), col ="black", lwd =3)# Smooth fitlines(lowess(admission$chance_of_admit ~ admission$toefl_score), col ="orange", lwd =3)# Legendlegend("topleft", legend =c("Linear", "Smooth"),lwd =3, lty =c(2, 1, 1), col =c( "black", "orange"))
Code
ggplot(admission, aes(x = toefl_score, y = chance_of_admit, color =as.factor(university_rating))) +geom_jitter(width = .2) +scale_color_manual(values =c("#270181", "coral","yellow","blue","green")) +ggtitle("Chance of Admit by Toefl Score per University Ranking") +theme(plot.title =element_text(color ="black", size =14, face ="bold"))
Code
ggplot(admission, aes(x = toefl_score, y = chance_of_admit, color =as.factor(research))) +geom_jitter(width = .2)+scale_color_manual(values =c("#270181", "coral"))+ggtitle("Chance of Admit by Toefl Score per Research Experience")+theme(plot.title =element_text(color ="black", size =14, face ="bold"))
Code
#create scatterplot of x1 vs. y1plot(chance_of_admit ~ gre_score , data = admission, col='blue', pch=19,xlab='GRE Score', ylab='Chance of Admit', main='Chance of Admit vs Gre Score')# Linear fitabline(lm(admission$chance_of_admit ~ admission$gre_score), col ="black", lwd =3)# Smooth fitlines(lowess(admission$chance_of_admit ~ admission$gre_score), col ="orange", lwd =3)# Legendlegend("topleft", legend =c("Linear", "Smooth"),lwd =3, lty =c(2, 1, 1), col =c( "black", "orange"))
Code
ggplot(admission, aes(x = gre_score, y = chance_of_admit, color =as.factor(university_rating))) +geom_point() +scale_color_manual(values =c("#270181", "coral","yellow","blue","green")) +ggtitle("Chance of Admit by Gre Score per University Ranking")+theme(plot.title =element_text(color ="black", size =14, face ="bold"))
Code
ggplot(admission, aes(x = gre_score, y = chance_of_admit, color =as.factor(research))) +geom_point() +scale_color_manual(values =c("#270181", "coral")) +ggtitle("Chance of Admit by Gre Score per Research experience")+theme(plot.title =element_text(color ="black", size =14, face ="bold"))
Code
#create scatterplot of x1 vs. y1ggplot(admission, aes(x = cgpa, y = chance_of_admit, color =as.factor(university_rating))) +geom_jitter(width = .2)+scale_color_manual(values =c("#270181", "coral","yellow","blue","green")) +ggtitle("Chance of Admit by GPA per University Rating")+theme(plot.title =element_text(color ="black", size =14, face ="bold"))
Code
data <-cor(admission[sapply(admission,is.numeric)])data1 <-melt(data)ggplot(data1, aes(x = Var1, y = Var2, fill = value)) +coord_fixed()+geom_tile() +scale_fill_distiller(palette ="Spectral") +theme(axis.text.x =element_text(angle=90, vjust=.5, hjust=1)) +labs(title ="Admission Correlation Heatmap", x =" ",y =" ") +geom_text(aes(label =round(value,digit =2)), color ="black", size =3) +theme(plot.title =element_text(color ="black", size =14, face ="bold"))
Analysis and Results
Code
# Scaling of Data#admission <- scale(admission)#admission <- as.data.frame(admission)
Code
# Splitting the Data#make this example reproducibleset.seed(1)#Use 70% of dataset as training set and remaining 30% as testing setsample <-sample(c(TRUE, FALSE), nrow(admission), replace=TRUE, prob=c(0.7,0.3))train <- admission[sample, ]test <- admission[!sample, ]#view dimensions of training set#dim(train)#view dimensions of test set#dim(test)X_train <- trainX_test =subset(test,select =-c(chance_of_admit))keeps <-c("chance_of_admit")y_test = test[keeps]
Code
gy_logit <-betareg(chance_of_admit ~ serial_no + gre_score + toefl_score + university_rating + sop + lor + cgpa + research + sop*lor, data = train)summary(gy_logit)
Call:
betareg(formula = chance_of_admit ~ serial_no + gre_score + toefl_score +
university_rating + sop + lor + cgpa + research + sop * lor, data = train)
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-3.4463 -0.4863 0.1052 0.6737 2.4658
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.4214227 0.7593622 -12.407 < 2e-16 ***
serial_no 0.0007082 0.0001744 4.061 4.88e-05 ***
gre_score 0.0090894 0.0033019 2.753 0.005909 **
toefl_score 0.0245819 0.0062882 3.909 9.26e-05 ***
university_rating 0.0491160 0.0293915 1.671 0.094702 .
sop -0.2410749 0.0720240 -3.347 0.000816 ***
lor -0.0799725 0.0740654 -1.080 0.280251
cgpa 0.5700814 0.0686208 8.308 < 2e-16 ***
research 0.1508322 0.0444278 3.395 0.000686 ***
sop:lor 0.0637809 0.0209409 3.046 0.002321 **
Phi coefficients (precision model with identity link):
Estimate Std. Error z value Pr(>|z|)
(phi) 52.165 4.329 12.05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 422.3 on 11 Df
Pseudo R-squared: 0.845
Number of iterations: 21 (BFGS) + 2 (Fisher scoring)
Code
library(leaps)models <-regsubsets(chance_of_admit~., data = admission, nvmax =9,method ="seqrep")summary(models)
# Set seed for reproducibilityset.seed(123)# Set up repeated k-fold cross-validationtrain.control <-trainControl(method ="cv", number =10)# Train the modelstep.model <-train(chance_of_admit ~., data = admission,method ="leapSeq", tuneGrid =data.frame(nvmax =1:8),trControl = train.control )step.model$results
#Predictions on the test set#predictions <- predict(gy_logit,test[-9],type = "response")#lrtest(gy_logit)#AIC(gy_logit)#vif(gy_logit) #Anova(gy_logit3)
Code
gy_logit <-betareg(chance_of_admit ~ serial_no + gre_score + toefl_score + university_rating + lor + cgpa + research, data = train)AIC(gy_logit)
[1] -815.3287
Code
gy_logit$pseudo.r.squared
[1] 0.8357231
Code
gy_logit1 <-betareg(chance_of_admit ~ serial_no + gre_score + toefl_score + university_rating + lor + cgpa , data = train)AIC(gy_logit1)
[1] -807.4456
Code
gy_logit1$pseudo.r.squared
[1] 0.8296896
Code
gy_logit2 <-betareg(chance_of_admit ~ serial_no + gre_score + toefl_score + university_rating + lor + research, data = train)AIC(gy_logit2)
gy_logit4 <-betareg(chance_of_admit ~ serial_no + gre_score + toefl_score + lor + cgpa + research, data = train)AIC(gy_logit4)
[1] -813.6016
Code
gy_logit4$pseudo.r.squared
[1] 0.8329967
Code
gy_logit5 <-betareg(chance_of_admit ~ serial_no + gre_score + university_rating + lor + cgpa + research, data = train)AIC(gy_logit5)
[1] -805.1029
Code
gy_logit5$pseudo.r.squared
[1] 0.8269614
Code
gy_logit6 <-betareg(chance_of_admit ~ serial_no + toefl_score + university_rating + lor + cgpa + research, data = train)AIC(gy_logit6)
[1] -810.2971
Code
gy_logit6$pseudo.r.squared
[1] 0.8320949
Code
gy_logit7 <-betareg(chance_of_admit ~ gre_score + toefl_score + university_rating + lor + cgpa + research, data = train)AIC(gy_logit7)
[1] -798.5518
Code
gy_logit7$pseudo.r.squared
[1] 0.8262585
Code
predictions <-predict(gy_logit,test[-9],type ="response")df<-data.frame(predictions,y_test)ggplot(df, aes(x=chance_of_admit, y = predictions )) +geom_point(color ="red") +geom_abline(intercept=0, slope=1, color ="yellow", lwd =1) +labs(y='Predicted Values', x ='Actual Values', title=' Predicted vs. Actual Values') +theme_bw()
Code
predictions <-predict(gy_logit,test[-9],type ="response")df<-data.frame(predictions,y_test)ggplot(df, aes(x=1:nrow(df))) +geom_line(aes(y = predictions), color ="darkred") +geom_line(aes(y = test$chance_of_admit), color="steelblue")+labs( y =" ", x =" Number of rows ",title=' Predicted values - red vs. Actual Values - blue') +theme_bw()
Code
predictions <-predict(gy_logit4,test[-9],type ="response")df<-data.frame(predictions,y_test)ggplot(df, aes(x=chance_of_admit, y = predictions )) +geom_point(color ="blue") +geom_abline(intercept=0, slope=1, color ="green", lwd =1) +labs(y='Predicted Values', x ='Actual Values', title=' Predicted vs. Actual Values') +theme_bw()
Code
plot(gy_logit3,main ="Model 3")
Code
plot(fitted(gy_logit3),residuals(gy_logit3))
Conclusion
References
Abonazel, Mohamed R., Issam Dawoud, Fuad A. Awwad, and Adewale F. Lukman. 2022. “Dawoud–Kibria Estimator for Beta Regression Model: Simulation and Application.”Frontiers in Applied Mathematics and Statistics 8. https://doi.org/10.3389/fams.2022.775068.
Couri, Lucas, Raydonal Ospina, Geiza da Silva, Víctor Leiva, and Jorge Figueroa-Zúñiga. 2022. “A Study on Computational Algorithms in the Estimation of Parameters for a Class of Beta Regression Models.”Mathematics 10 (3): 299. https://doi.org/10.3390/math10030299.
Cribari-Neto, Francisco, and Achim Zeileis. 2010. “Beta Regression in r.”Journal of Statistical Software 34 (2): 1–24. https://doi.org/10.18637/jss.v034.i02.
Douma, Jacob C., and James T. Weedon. 2019. “Analysing Continuous Proportions in Ecology and Evolution: A Practical Introduction to Beta and Dirichlet Regression.”Methods in Ecology and Evolution 10 (9): 1412–30. https://doi.org/https://doi.org/10.1111/2041-210X.13234.
Guolo, Annamaria, and Cristiano Varin. 2014. “Beta regression for time series analysis of bounded data, with application to Canada Google® Flu Trends.”The Annals of Applied Statistics 8 (1): 74–88. https://doi.org/10.1214/13-AOAS684.
Ospina, Raydonal, and Silvia L. P. Ferrari. 2012. “A General Class of Zero-or-One Inflated Beta Regression Models.”Computational Statistics & Data Analysis 56 (6): 1609–23. https://doi.org/https://doi.org/10.1016/j.csda.2011.10.005.
Patrícia L. Espinheira, Silvia L. P. Ferrari, and Francisco Cribari-Neto. 2008. “On Beta Regression Residuals.”Journal of Applied Statistics 35 (4): 407–19. https://doi.org/10.1080/02664760701834931.
Simas, Alexandre B., Wagner Barreto-Souza, and Andréa V. Rocha. 2010. “Improved estimators for a general class of beta regression models.”Computational Statistics & Data Analysis 54 (2): 348–66. https://ideas.repec.org/a/eee/csdana/v54y2010i2p348-366.html.